En este informe se utilizan datos del sitio: DrivenData. (2015). Pump it Up: Data Mining the Water Table. Retrieved [10 01 2023] from https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table.
El Dataset contiene diversos parametros que caracterizan el funcionamiento de bombas de agua. El objetivo general es predecir cual es es estado de bombas de agua, diferenciando cuales estan funcionales, cuales necesitan alguna reparacion y cuales no funcionan correctamente.
The goal is to predict the operating condition of a waterpoint for each record in the dataset.
Se utilizan los siguientes paquetes para el analisis:
#-----------------------------------------
# Library loading
rm(list = ls())
suppressPackageStartupMessages({
library(data.table)
library(dplyr)
library(caret)
library(scales)
library(ggplot2)
library(stringi)
library(stringr)
library(dataPreparation)
library(knitr)
library(kableExtra)
library(ggpubr)
library(tictoc)
library(ggeasy)
library(lubridate)
library(inspectdf)
library(questionr ) # freq
library(gbm)
})En esta seccion se realiza un analisis inicial del dataset y se plantea un modelo inicial, con el objetivo de posteriormente mejorar tanto el dataset usando tecnicas basicas de Feature Engineering y tuneado de modelos. Se aclara que este proyecto no fue presentado a tiempo para ingresar en el concurso.
Se busca rapidamente por duplicados en los datos, de haber duplicados estos serian eliminados.
## Número de duplicados en los predictores: 0
## Número de duplicados en las etiquetas: 0
El numero de variables del problema:
## El número de campos es: 40
Se procede a limpiar datos y entender las clases del conjunto de entrada. Se elimina las columnas id al no aportar información.
Se comprueba que si hay el mismo número de columnas en los dataset que el estipulado en la descripción del problema. La estructura de la base de datos es la siguiente:
## Classes 'data.table' and 'data.frame': 59400 obs. of 40 variables:
## $ amount_tsh : num 6000 0 25 0 0 20 0 0 0 0 ...
## $ date_recorded : IDate, format: "2011-03-14" "2013-03-06" ...
## $ funder : chr "Roman" "Grumeti" "Lottery Club" "Unicef" ...
## $ gps_height : int 1390 1399 686 263 0 0 0 0 0 0 ...
## $ installer : chr "Roman" "GRUMETI" "World vision" "UNICEF" ...
## $ longitude : num 34.9 34.7 37.5 38.5 31.1 ...
## $ latitude : num -9.86 -2.15 -3.82 -11.16 -1.83 ...
## $ wpt_name : chr "none" "Zahanati" "Kwa Mahundi" "Zahanati Ya Nanyumbu" ...
## $ num_private : int 0 0 0 0 0 0 0 0 0 0 ...
## $ basin : chr "Lake Nyasa" "Lake Victoria" "Pangani" "Ruvuma / Southern Coast" ...
## $ subvillage : chr "Mnyusi B" "Nyamara" "Majengo" "Mahakamani" ...
## $ region : chr "Iringa" "Mara" "Manyara" "Mtwara" ...
## $ region_code : int 11 20 21 90 18 4 17 17 14 18 ...
## $ district_code : int 5 2 4 63 1 8 3 3 6 1 ...
## $ lga : chr "Ludewa" "Serengeti" "Simanjiro" "Nanyumbu" ...
## $ ward : chr "Mundindi" "Natta" "Ngorika" "Nanyumbu" ...
## $ population : int 109 280 250 58 0 1 0 0 0 0 ...
## $ public_meeting : logi TRUE NA TRUE TRUE TRUE TRUE ...
## $ recorded_by : chr "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" ...
## $ scheme_management : chr "VWC" "Other" "VWC" "VWC" ...
## $ scheme_name : chr "Roman" "" "Nyumba ya mungu pipe scheme" "" ...
## $ permit : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ construction_year : int 1999 2010 2009 1986 0 2009 0 0 0 0 ...
## $ extraction_type : chr "gravity" "gravity" "gravity" "submersible" ...
## $ extraction_type_group: chr "gravity" "gravity" "gravity" "submersible" ...
## $ extraction_type_class: chr "gravity" "gravity" "gravity" "submersible" ...
## $ management : chr "vwc" "wug" "vwc" "vwc" ...
## $ management_group : chr "user-group" "user-group" "user-group" "user-group" ...
## $ payment : chr "pay annually" "never pay" "pay per bucket" "never pay" ...
## $ payment_type : chr "annually" "never pay" "per bucket" "never pay" ...
## $ water_quality : chr "soft" "soft" "soft" "soft" ...
## $ quality_group : chr "good" "good" "good" "good" ...
## $ quantity : chr "enough" "insufficient" "enough" "dry" ...
## $ quantity_group : chr "enough" "insufficient" "enough" "dry" ...
## $ source : chr "spring" "rainwater harvesting" "dam" "machine dbh" ...
## $ source_type : chr "spring" "rainwater harvesting" "dam" "borehole" ...
## $ source_class : chr "groundwater" "surface" "surface" "groundwater" ...
## $ waterpoint_type : chr "communal standpipe" "communal standpipe" "communal standpipe multiple" "communal standpipe multiple" ...
## $ waterpoint_type_group: chr "communal standpipe" "communal standpipe" "communal standpipe" "communal standpipe" ...
## $ status_group : chr "functional" "functional" "functional" "non functional" ...
## - attr(*, ".internal.selfref")=<externalptr>
Podemos ver a partir de la estructura: - La variable date_recorded ha sido correctamente asignada con el tipo Date. Aunque se intuye que tiene demasiadas categorÃas, y habrá que transformala. - Tres variables han sido incorrectamente asignadas como tipo int. - Dos variables han sido interpretadas como valores lógicos, se procederá a cambiarlas por caracteres. - status_group es de tipo caracter.
Por lo cual se efectúa una copia del dataset y la llamamos datMod. Los cambios se aplicarán sobre datMod (train) y predict_x (test). Se eliminan las variables originales de las que se derivan las transformadas para reducir el riesgo de alta correlación/asociación de variables y con ello el sobreajuste. Se cambia el tipo de la variable objetivo (status_group) por factor.
datMod <- copy(datIn)
datMod$region_code <- as.character(datMod$region_code)
datMod$district_code <- as.character(datMod$district_code)
datMod$construction_year <- as.character(datMod$construction_year)
datMod$permit <- as.character(as.integer(datMod$permit))
datMod$public_meeting <- as.character(as.integer(datMod$public_meeting))
predict_x$region_code <- as.character(predict_x$region_code)
predict_x$district_code <- as.character(predict_x$district_code)
predict_x$construction_year <- as.character(predict_x$construction_year)
predict_x$permit <- as.character(as.integer(predict_x$permit))
predict_x$public_meeting <- as.character(as.integer(predict_x$public_meeting))
datMod$status_group <- as.factor(datMod$status_group)
str(datMod)## Classes 'data.table' and 'data.frame': 59400 obs. of 40 variables:
## $ amount_tsh : num 6000 0 25 0 0 20 0 0 0 0 ...
## $ date_recorded : IDate, format: "2011-03-14" "2013-03-06" ...
## $ funder : chr "Roman" "Grumeti" "Lottery Club" "Unicef" ...
## $ gps_height : int 1390 1399 686 263 0 0 0 0 0 0 ...
## $ installer : chr "Roman" "GRUMETI" "World vision" "UNICEF" ...
## $ longitude : num 34.9 34.7 37.5 38.5 31.1 ...
## $ latitude : num -9.86 -2.15 -3.82 -11.16 -1.83 ...
## $ wpt_name : chr "none" "Zahanati" "Kwa Mahundi" "Zahanati Ya Nanyumbu" ...
## $ num_private : int 0 0 0 0 0 0 0 0 0 0 ...
## $ basin : chr "Lake Nyasa" "Lake Victoria" "Pangani" "Ruvuma / Southern Coast" ...
## $ subvillage : chr "Mnyusi B" "Nyamara" "Majengo" "Mahakamani" ...
## $ region : chr "Iringa" "Mara" "Manyara" "Mtwara" ...
## $ region_code : chr "11" "20" "21" "90" ...
## $ district_code : chr "5" "2" "4" "63" ...
## $ lga : chr "Ludewa" "Serengeti" "Simanjiro" "Nanyumbu" ...
## $ ward : chr "Mundindi" "Natta" "Ngorika" "Nanyumbu" ...
## $ population : int 109 280 250 58 0 1 0 0 0 0 ...
## $ public_meeting : chr "1" NA "1" "1" ...
## $ recorded_by : chr "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" ...
## $ scheme_management : chr "VWC" "Other" "VWC" "VWC" ...
## $ scheme_name : chr "Roman" "" "Nyumba ya mungu pipe scheme" "" ...
## $ permit : chr "0" "1" "1" "1" ...
## $ construction_year : chr "1999" "2010" "2009" "1986" ...
## $ extraction_type : chr "gravity" "gravity" "gravity" "submersible" ...
## $ extraction_type_group: chr "gravity" "gravity" "gravity" "submersible" ...
## $ extraction_type_class: chr "gravity" "gravity" "gravity" "submersible" ...
## $ management : chr "vwc" "wug" "vwc" "vwc" ...
## $ management_group : chr "user-group" "user-group" "user-group" "user-group" ...
## $ payment : chr "pay annually" "never pay" "pay per bucket" "never pay" ...
## $ payment_type : chr "annually" "never pay" "per bucket" "never pay" ...
## $ water_quality : chr "soft" "soft" "soft" "soft" ...
## $ quality_group : chr "good" "good" "good" "good" ...
## $ quantity : chr "enough" "insufficient" "enough" "dry" ...
## $ quantity_group : chr "enough" "insufficient" "enough" "dry" ...
## $ source : chr "spring" "rainwater harvesting" "dam" "machine dbh" ...
## $ source_type : chr "spring" "rainwater harvesting" "dam" "borehole" ...
## $ source_class : chr "groundwater" "surface" "surface" "groundwater" ...
## $ waterpoint_type : chr "communal standpipe" "communal standpipe" "communal standpipe multiple" "communal standpipe multiple" ...
## $ waterpoint_type_group: chr "communal standpipe" "communal standpipe" "communal standpipe" "communal standpipe" ...
## $ status_group : Factor w/ 3 levels "functional","functional needs repair",..: 1 1 1 3 1 1 3 3 3 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
Frecuencia de variables categoricas
Correlaciones
A partir de los graficos exploratorio podemos desprender: - En general
se ve que predominan las variables categoricas. En este caso, un metodo
que puede ser util es utilizar modelos de arboles, ya que pueden tener
alto grado de acierto en las predicciones. - Variables cuantitativas no
tienen unas correlaciones con valores absolutos por encima de 0.95,
reduciendo asà el riesgo de overfitting. A simple vista, no parecen
tener valores fuera de rango. - Algunas variables tienen frecuencias muy
desbalancedas o pequeñas que pueden dar lugar a overfitting y altos
costes computacionales. - Las variables: funnder, installer, Iga,
scheme_name, subvillage, ward, wpt_name tienen muchas categorias
(>20). - La variable date_recorded ha sido correctamente asignada con
el tipo Date. Aunque se intuye que tiene demasiadas categorÃas, y habrá
que transformala. - Las variables: permit y public_meeting tienen
valores ausentes (missing values) mayores a un 5% de los datos.
Reclasificaremos los valores faltantes en una clase ‘Desconocido’
datMod$fe_permit<-car::recode(datMod$permit,"NA='Desconocido'")
datMod$fe_public_meeting<-car::recode(datMod$public_meeting,"NA='Desconocido'")
datMod$permit <- NULL
datMod$public_meeting <- NULL
predict_x$fe_permit<-car::recode(predict_x$permit,"NA='Desconocido'")
predict_x$fe_public_meeting<-car::recode(predict_x$public_meeting,"NA='Desconocido'")
predict_x$permit <- NULL
predict_x$public_meeting <- NULLdatMod$num_private <- NULL
datMod$recorded_by <- NULL
predict_x$num_private <- NULL
predict_x$recorded_by <- NULLSe evalúa si amount_tsh y population presentan menos de 10 categorÃas. De ser asÃ, se cambiarÃa su tipo a categórica. Se analizan sus frecuencias de valores.
Variable amount_tsh: contiene mas de un 70%, lo cual indicaria que dichos acuiferos estan secos, se asume esta es informacion real y no datos faltantes. Por tanto, se conserva esta variable.
Variable population: Muestra que la zona próxima en un 36% de los acuÃferos esta deshabitada, por lo que en consonancia con la polÃtica de mÃnimas transformaciones para esta primera iteración, no se modifica esta variable.
Se confirma que longitude tiene más de 10 valores, lo cual el histograma del EDA daba lugar a dudas.
datMod$fe_construction_year<-car::recode(datMod$construction_year,"0='Desconocido'")
datMod$construction_year <- NULL
freq(datMod$fe_construction_year, sort = 'dec')predict_x$fe_construction_year<-car::recode(predict_x$construction_year,"0='Desconocido'")
predict_x$construction_year <- NULLLa variable date_recorded presenta multitud de categorÃas que 1) aumentan tiempo de computación y 2) incrementan el riesgo de overfitting. Se crearán nuevas variables a partir de ella y se luego se eliminará.
datMod$fe_anio <- year(datMod$date_recorded)
datMod$fe_mes <- month(datMod$date_recorded)
datMod$fe_dianum <- day(datMod$date_recorded)
datMod$fe_diasem <- wday(datMod$date_recorded, label = TRUE, abbr = TRUE)
datMod$date_recorded <- NULL
predict_x$fe_anio <- year(predict_x$date_recorded)
predict_x$fe_mes <- month(predict_x$date_recorded)
predict_x$fe_dianum <- day(predict_x$date_recorded)
predict_x$fe_diasem <- wday(predict_x$date_recorded, label = TRUE, abbr = TRUE)
predict_x$date_recorded <- NULLHay otras variables categóricas que presentan muchas categorÃas, como se puede observar en el EDA y en la siguiente tabla. Se les aplicará una codificación (transformación) lumping a aquellas con mayor número de categorÃas.
freq.input.cat <- sapply(Filter(is.character, datMod),function(x) length(unique(x)))
freq.input.cat <- sort(freq.input.cat, decreasing = TRUE)
kable(freq.input.cat)| x | |
|---|---|
| wpt_name | 37400 |
| subvillage | 19288 |
| scheme_name | 2697 |
| installer | 2146 |
| ward | 2092 |
| funder | 1898 |
| lga | 125 |
| fe_construction_year | 55 |
| region_code | 27 |
| region | 21 |
| district_code | 20 |
| extraction_type | 18 |
| scheme_management | 13 |
| extraction_type_group | 13 |
| management | 12 |
| source | 10 |
| basin | 9 |
| water_quality | 8 |
| extraction_type_class | 7 |
| payment | 7 |
| payment_type | 7 |
| source_type | 7 |
| waterpoint_type | 7 |
| quality_group | 6 |
| waterpoint_type_group | 6 |
| management_group | 5 |
| quantity | 5 |
| quantity_group | 5 |
| source_class | 3 |
| fe_permit | 3 |
| fe_public_meeting | 3 |
Se hace lumping a las 6 primeras variables, las cuales presentan miles de categorÃas. Para esas 6 variables, se agrupan aquellas categorÃas con menos de un 5% de frecuencia. Si no hay ninguna categorÃa mayor de 5%, se elimina la variable. Obviamente todas estas transformaciones que se aplican al train se deben de efectuarse también en el test.
La variable wpt_name pasa a tener dos categorÃas, ‘named’ o ‘none’. Se comprueba que ambos datasets tienen frecuencias muy parecidas, indicando (de forma no suficiente) que pueden venir ambos de una misma muestra.
# categorÃas exceptuando 'none'
wpt.name.labels <- unique(datMod$wpt_name)
wpt.name.labels <- wpt.name.labels[!(wpt.name.labels %in% 'none')]
datMod$fe_wpt_name <-car::recode(datMod$wpt_name, "wpt.name.labels = 'named'")
datMod$wpt_name <- NULL
# repito etiquetqas por que hay resÃduos no contemplados en train
wpt.name.labels <- unique(predict_x$wpt_name)
wpt.name.labels <- wpt.name.labels[!(wpt.name.labels %in% 'none')]
predict_x$fe_wpt_name <-car::recode(predict_x$wpt_name, "wpt.name.labels = 'named'")
predict_x$wpt_name <- NULL
freq(datMod$fe_wpt_name, sort = 'dec')Se elimina la categorÃa subvillage por no tener ninguna frecuencia superior al 5%.
Se pasó por alto el gran número de missings en scheme_name. Al no suponer más del 50% se renombrarán como ‘Desconocido’ y a las demás se agruparán como ‘Conocido’. De nuevo, training y set tienen frecuencias similares tras renombrar.
scheme_name.labels <- unique(datMod$scheme_name)
scheme_name.labels <- scheme_name.labels[!(scheme_name.labels %in% "")]
datMod$fe_scheme_name <-car::recode(datMod$scheme_name, "scheme_name.labels = 'Conocido'")
datMod$fe_scheme_name <-car::recode(datMod$fe_scheme_name, "'' = 'Desconocido'")
datMod$scheme_name <- NULL
scheme_name.labels <- unique(predict_x$scheme_name)
scheme_name.labels <- scheme_name.labels[!(scheme_name.labels %in% "")]
predict_x$fe_scheme_name <-car::recode(predict_x$scheme_name, "scheme_name.labels = 'Conocido'")
predict_x$fe_scheme_name <-car::recode(predict_x$fe_scheme_name, "'' = 'Desconocido'")
predict_x$scheme_name <- NULL
freq(datMod$fe_scheme_name, sort = 'dec')Agrupamos en ‘Otros’ en la variable installer y renombramos los missings no detectados anteriormente al pasar de 5%.
installer.labels <- unique(datMod$installer)
installer.labels <- installer.labels[!(installer.labels %in% c('','DWE'))]
datMod$fe_installer <-car::recode(datMod$installer, "installer.labels = 'Otros'")
datMod$fe_installer <-car::recode(datMod$fe_installer, "'' = 'Desconocido'")
datMod$installer <- NULL
freq(datMod$fe_installer, sort = 'dec')installer.labels <- unique(predict_x$installer)
installer.labels <- installer.labels[!(installer.labels %in% c('','DWE'))]
predict_x$fe_installer <-car::recode(predict_x$installer, "installer.labels = 'Otros'")
predict_x$fe_installer <-car::recode(predict_x$fe_installer, "'' = 'Desconocido'")
predict_x$installer <- NULL
freq(predict_x$fe_installer, sort = 'dec')Se elimina la variable ward, ya que es redundante a la variable lga que indica ubicacion geografica.
Se modifica la variable funder.
funder.labels <- unique(datMod$funder)
funder.labels <- funder.labels[!(funder.labels %in% c('','Government Of Tanzania','Danida'))]
datMod$fe_funder <-car::recode(datMod$funder, "funder.labels = 'Otros'")
datMod$fe_funder <-car::recode(datMod$fe_funder, "'' = 'Desconocido'")
freq(datMod$fe_funder, sort = 'dec')datMod$funder <- NULL
funder.labels <- unique(predict_x$funder)
funder.labels <- funder.labels[!(funder.labels %in% c('','Government Of Tanzania','Danida'))]
predict_x$fe_funder <-car::recode(predict_x$funder, "funder.labels = 'Otros'")
predict_x$fe_funder <-car::recode(predict_x$fe_funder, "'' = 'Desconocido'")
freq(predict_x$fe_funder, sort = 'dec')Los algoritmos a analizar son solo dos, ranger (random forest) y gbm (boosting) por restricciones de tiempo. Se construyen modelos a partir del los inputs modificados en esta primera iteración, datMod, a evaluar en el igualmente transformado conjunto test predict_x.
Se divide datMod en un conjunto train y otro validation, siendo las proporciones 80% y 20% respectivamente. Los modelos se entrenan con train y se evalúan con validation. Posteriormente, se probarán con el conjunto test aquellos modelos que presenten una accuracy más alta en validación.
Dividir el train en dos subconjuntos tiene la desventaja de que se entrena con una población menor pero se consigue tener un muestra completamente independiente para evaluar antes de subir las predicciones al sitio web, donde solo se permite 3 soluciones al dÃa. De no haber tal restricción no se habrÃa establecido el conjunto validación. No obstante, dado al número alto de muestras proporcionados, quitar un 20% no tendrÃa que suponer una fuerte penalización en cuanto a accuracy.
Debido a las restricciones de deadline y capacidad computacional, se evalúan dos técnicas de remuestreo, niguna o ‘none’ (un solo fold) y stratifiedKfolds. El none sirve para estimar los tiempos de computación requeridos, lo cual condiciona el número de folds y repeticiones asumible. Por otro lado, al haber variables con muchas categorÃas, podrÃa reducir el overfitting también en este problema en particular. No se aplica grid search en los hiperparámetros.
Uso de la función de caret createDataPartition para hacer el remuestreo tipo stratifiedKfolds (5 folds) con shuffle.
En esta primera iteración no se hacen repeticiones con ningún modelo.
Se empieza con un ranger, con parámetros con valores o bien por defecto o bien ‘comunes’, al ser usados en códigos de profesores del máster en Data Science que el autor está cursando. No hay grid search paramétrico en esta primera iteración para ningún algoritmo.
En aras de mantener una cierta consitencia, los valores de parámetros se mantendrán iguales para aquellos compartidos por los algoritmos. Obviamente, dicha consistencia no es perfecta al tratarse de distintos algoritmos. Por ejemplo, num.trees = 100 tanto para ranger como gbm.
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(ranger)
library(tictoc)
library(caret)
library(ggplot2)
library(lattice)
# registerDoMC(2)
# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(2023)
validationIndex <- createDataPartition(datMod$status_group, p=0.80, list=FALSE)
# select 20% of the data for validation
my_test <- datMod[-validationIndex,]
# use the remaining 80% of data to training and testing the
my_train <- datMod[validationIndex,]tic()
set.seed(2023)
fit_r1 <- ranger(
status_group ~. ,
data = my_train,
num.trees = 100,
importance = 'impurity',
write.forest = TRUE,
min.node.size = 1,
splitrule = "gini",
verbose = TRUE,
classification = TRUE
)
toc()## 14.464 sec elapsed
El tiempo de ejecucion es alrededor de 20 segundos, esto da una idea de cómo de exhaustivos podemos ser ante futuras evaluaciones con cv (stratified) y grid search.
## Ranger result
##
## Call:
## ranger(status_group ~ ., data = my_train, num.trees = 100, importance = "impurity", write.forest = TRUE, min.node.size = 1, splitrule = "gini", verbose = TRUE, classification = TRUE)
##
## Type: Classification
## Number of trees: 100
## Sample size: 47522
## Number of independent variables: 38
## Mtry: 6
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 18.93 %
## Length Class Mode
## predictions 47522 factor numeric
## num.trees 1 -none- numeric
## num.independent.variables 1 -none- numeric
## mtry 1 -none- numeric
## min.node.size 1 -none- numeric
## variable.importance 38 -none- numeric
## prediction.error 1 -none- numeric
## forest 10 ranger.forest list
## confusion.matrix 9 table numeric
## splitrule 1 -none- character
## treetype 1 -none- character
## call 10 -none- call
## importance.mode 1 -none- character
## num.samples 1 -none- numeric
## replace 1 -none- logical
Importancia de las variables
vars_imp <- fit_r1$variable.importance
vars_imp <- as.data.frame(vars_imp)
vars_imp$myvar <- rownames(vars_imp)
library(ggpubr)
ggbarplot(vars_imp[1:10,],
x = "myvar", y = "vars_imp",
#fill = 'myvar',
color = "blue", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in descending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 90, # Rotate vertically x axis texts
ylab = "Importancia",
xlab = 'Variable',
#legend.title = "MPG Group",
rotate = TRUE,
ggtheme = theme_minimal(),
main = "ranger con dataMod"
)
Validacion con el Test set.
validation.fit_r1 <- predict(fit_r1, data = my_test)
confusionMatrix( my_test$status_group, validation.fit_r1$predictions)## Confusion Matrix and Statistics
##
## Reference
## Prediction functional functional needs repair non functional
## functional 5790 136 525
## functional needs repair 458 277 128
## non functional 954 87 3523
##
## Overall Statistics
##
## Accuracy : 0.8074
## 95% CI : (0.8002, 0.8144)
## No Information Rate : 0.6063
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6383
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: functional Class: functional needs repair
## Sensitivity 0.8039 0.55400
## Specificity 0.8586 0.94850
## Pos Pred Value 0.8975 0.32097
## Neg Pred Value 0.7398 0.97975
## Prevalence 0.6063 0.04209
## Detection Rate 0.4875 0.02332
## Detection Prevalence 0.5431 0.07266
## Balanced Accuracy 0.8313 0.75125
## Class: non functional
## Sensitivity 0.8436
## Specificity 0.8648
## Pos Pred Value 0.7719
## Neg Pred Value 0.9107
## Prevalence 0.3516
## Detection Rate 0.2966
## Detection Prevalence 0.3842
## Balanced Accuracy 0.8542
Despues de esta primera iteracion, se ha obtenido un acierto del 80% (0.8074). El OOB prediction error sale 18.93%. Procedemos a almacenar la prediccion en un archivo csv para ser utilizado como input en la plataforma del concurso:
La primera iteración consistió en lanzar modelos con las mÃnimas transformaciones posibles, para evitar excesivos overfitting y tiempos de computación. En la iteración 2 se realizan tres transformaciones extra al conjunto de variables input:
Se aplican unos lÃmites menos agresivos que en la primera iteración, dado que el número de categorÃas ya ha sido reducido en la iteración A y el número de observaciones es bastante grande. Dicho lÃmite es 2% tanto para recategorizar como para considerar un valor como outlier.
Lo primero, se transforman algunas variables derivadas de la fecha a categóricas, ya que se presuponen valores discretos. El sufijo ‘2’ que se añade para renombrar estas variables representa la iteración donde estas se han transformado.
Los conjuntos train/test derivados de esta segunda iteración pasarán a llamarse datMod2 y predict_x2 respectivamente.
datMod2 <- copy(datMod)
predict_x2 <- copy(predict_x)
datMod2$fe_anio2 <- as.character(datMod2$fe_anio)
datMod2$fe_mes2 <- as.character(datMod2$fe_mes)
datMod2$fe_dianum2 <- as.character(datMod2$fe_dianum)
datMod2$fe_anio <- NULL
datMod2$fe_mes <- NULL
datMod2$fe_dianum <- NULL
predict_x2$fe_anio2 <- as.character(predict_x2$fe_anio)
predict_x2$fe_mes2 <- as.character(predict_x2$fe_mes)
predict_x2$fe_dianum2 <- as.character(predict_x2$fe_dianum)
predict_x2$fe_anio <- NULL
predict_x2$fe_mes <- NULL
predict_x2$fe_dianum <- NULL
str(datMod2) ## Classes 'data.table' and 'data.frame': 59400 obs. of 39 variables:
## $ amount_tsh : num 6000 0 25 0 0 20 0 0 0 0 ...
## $ gps_height : int 1390 1399 686 263 0 0 0 0 0 0 ...
## $ longitude : num 34.9 34.7 37.5 38.5 31.1 ...
## $ latitude : num -9.86 -2.15 -3.82 -11.16 -1.83 ...
## $ basin : chr "Lake Nyasa" "Lake Victoria" "Pangani" "Ruvuma / Southern Coast" ...
## $ region : chr "Iringa" "Mara" "Manyara" "Mtwara" ...
## $ region_code : chr "11" "20" "21" "90" ...
## $ district_code : chr "5" "2" "4" "63" ...
## $ lga : chr "Ludewa" "Serengeti" "Simanjiro" "Nanyumbu" ...
## $ population : int 109 280 250 58 0 1 0 0 0 0 ...
## $ scheme_management : chr "VWC" "Other" "VWC" "VWC" ...
## $ extraction_type : chr "gravity" "gravity" "gravity" "submersible" ...
## $ extraction_type_group: chr "gravity" "gravity" "gravity" "submersible" ...
## $ extraction_type_class: chr "gravity" "gravity" "gravity" "submersible" ...
## $ management : chr "vwc" "wug" "vwc" "vwc" ...
## $ management_group : chr "user-group" "user-group" "user-group" "user-group" ...
## $ payment : chr "pay annually" "never pay" "pay per bucket" "never pay" ...
## $ payment_type : chr "annually" "never pay" "per bucket" "never pay" ...
## $ water_quality : chr "soft" "soft" "soft" "soft" ...
## $ quality_group : chr "good" "good" "good" "good" ...
## $ quantity : chr "enough" "insufficient" "enough" "dry" ...
## $ quantity_group : chr "enough" "insufficient" "enough" "dry" ...
## $ source : chr "spring" "rainwater harvesting" "dam" "machine dbh" ...
## $ source_type : chr "spring" "rainwater harvesting" "dam" "borehole" ...
## $ source_class : chr "groundwater" "surface" "surface" "groundwater" ...
## $ waterpoint_type : chr "communal standpipe" "communal standpipe" "communal standpipe multiple" "communal standpipe multiple" ...
## $ waterpoint_type_group: chr "communal standpipe" "communal standpipe" "communal standpipe" "communal standpipe" ...
## $ status_group : Factor w/ 3 levels "functional","functional needs repair",..: 1 1 1 3 1 1 3 3 3 1 ...
## $ fe_permit : chr "0" "1" "1" "1" ...
## $ fe_public_meeting : chr "1" "Desconocido" "1" "1" ...
## $ fe_construction_year : chr "1999" "2010" "2009" "1986" ...
## $ fe_diasem : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 2 4 2 2 4 1 2 3 7 4 ...
## $ fe_wpt_name : chr "none" "named" "named" "named" ...
## $ fe_scheme_name : chr "Conocido" "Desconocido" "Conocido" "Desconocido" ...
## $ fe_installer : chr "Otros" "Otros" "Otros" "Otros" ...
## $ fe_funder : chr "Otros" "Otros" "Otros" "Otros" ...
## $ fe_anio2 : chr "2011" "2013" "2013" "2013" ...
## $ fe_mes2 : chr "3" "3" "2" "1" ...
## $ fe_dianum2 : chr "14" "6" "25" "28" ...
## - attr(*, ".internal.selfref")=<externalptr>
Las variables amount_tsh y population presentan algunos valores muy altos, observando la mediana y la media del conjunto train. Se comprueba si se tratan de outliers o de ‘datos grandes’, en función del porcentaje sobre el train.
Esta comprobación también se extiende, con menor riesgo de presencia de outliers, al resto de variables numéricas.
## [1] "amount_tsh"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 317.7 20.0 350000.0
## [1] "population"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 25.0 179.9 215.0 30500.0
## [1] "gps_height"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -90.0 0.0 369.0 668.3 1319.2 2770.0
## [1] "longitude"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 33.09 34.91 34.08 37.18 40.35
## [1] "latitude"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.649 -8.541 -5.022 -5.706 -3.326 0.000
Boxplots estandarizados con el valor máximo, para ayudar a visualizar la proporción de posibles outliers.
## [1] 18.78956
## [1] 7.378788
## [1] 0
Las variables amount_tsh y population presentan una porcentaje amplio de
datos grandes, mayor del 5%. Por tanto, no pueden borrarse todos
directamente. Sin embargo, viendo los boxplots estandarizados, se
aprecia que hay unas observaciones muy grandes que podrÃan perjudicar
los modelos.
Por tanto, se va a evaluar cómo quedan los boxplots poniendo como NAs aquellos valores en el percentil 98% e imputando con las medianas de las variables en el train. Estos boxplots están normalizados con los valores máximos del conjunto resultate de la iteración 1, datMod.
Como se mencionó anteriormente, se aplican los valores medianos del datMod2 a predict_x2 de cara a la imputación.
upper.bound.amount_tsh <- quantile(datMod2$amount_tsh, 0.98)
median.amount_tsh <- median(datMod2$amount_tsh, na.rm = TRUE)
datMod2$fe_amount_tsh2 <- copy(datMod2$amount_tsh)
datMod2$fe_amount_tsh2[datMod2$fe_amount_tsh2>upper.bound.amount_tsh] <- NA
datMod2$fe_amount_tsh2[is.na(datMod2$fe_amount_tsh2)] <- median.amount_tsh
boxplot(datMod2$fe_amount_tsh2/max(datMod2$amount_tsh),main="boxplot fe_amount_tsh2")datMod2$amount_tsh <- NULL
upper.bound.population <- quantile(datMod2$population, 0.98)
median.population <- median(datMod2$population, na.rm = TRUE)
datMod2$fe_population2 <- copy(datMod2$population)
datMod2$fe_population2[datMod2$fe_population2>upper.bound.population] <- NA
datMod2$fe_population2[is.na(datMod2$fe_population2)] <- median.population
boxplot(datMod2$fe_population2/max(datMod2$population),main="boxplot fe_populaton2")datMod2$population <- NULL
predict_x2$fe_amount_tsh2 <- copy(predict_x2$amount_tsh)
predict_x2$fe_amount_tsh2[predict_x2$fe_amount_tsh2>upper.bound.amount_tsh] <- NA
predict_x2$fe_amount_tsh2[is.na(predict_x2$fe_amount_tsh2)] <- median.amount_tsh
boxplot(predict_x2$fe_amount_tsh2/max(predict_x2$amount_tsh),main="boxplot fe_amount_tsh2")predict_x2$amount_tsh <- NULL
upper.bound.population <- quantile(predict_x2$population, 0.98)
median.population <- median(predict_x2$population, na.rm = TRUE)
predict_x2$fe_population2 <- copy(predict_x2$population)
predict_x2$fe_population2[predict_x2$fe_population2>upper.bound.population] <- NA
predict_x2$fe_population2[is.na(predict_x2$fe_population2)] <- median.population
boxplot(predict_x2$fe_population2/max(predict_x2$population),main="boxplot fe_populaton2")Se aprecia como los datos están más compactos, comparando la escala del eje y con respecto a los boxplots anteriores.
Antes de aplicar las siguientes transformaciones, se guarda el datMod2 con el tratamiento de outliers ya realizado en una nueva variable (datMod2_num) para la iteración 3. De esta forma, se puede comparar el efecto de eliminar outliers de forma independiente.
A continuación se realiza un lumping, agrupando aquellas variables que presenten categorÃas con frecuencias inferiores a 2%. En este proceso se excluye la variable dÃa del mes, fe_dianum2.
# Movemos status_group a la ultima columna, aseguro que predict_x2 y datMod2 presentan la misma enumeración, ya que las siguientes lÃneas lo asumen
datMod2 <- cbind(datMod2[,-26], datMod2[,26])
char.indices <- as.data.frame( which(sapply(datMod2, function(x) is.character(x))) )char.indices <- char.indices[!char.indices %in% char.indices[rownames(char.indices) == "fe_dianum2", 1]]
indices.changed <- c()
for (char.indice in char.indices[,1])
{
a <- as.data.frame( freq( datMod2[ , char.indice[1] ],sort = "inc"))
#indices <- a[a[,3] < 2,]
nrows <- nrow( a[a[,3] < 2,] ) # , print (a[a[,3] < 2,2])
#print ( nrows)
if (nrows>0)
{
indices.changed <- append(indices.changed, char.indice[1] )
datMod2[,char.indice[1]] <-car::recode(datMod2[,char.indice[1]], "rownames(indices) = 'Otros'")
predict_x2[,char.indice[1]] <-car::recode(predict_x2[,char.indice[1]], "rownames(indices) = 'Otros'")
}
}
library(stringr)
for (indice.changed in indices.changed)
{
name <- colnames(datMod2)[indice.changed]
if ((substr(name,1,3) != "fe_" ) || ( name == 'lga') )
{
name <- paste("fe_", name, sep="")
}
if (str_sub(name, start= -1) != "2")
{
name <- paste( name,"2", sep="")
colnames(datMod2)[indice.changed] <- name
colnames(predict_x2)[indice.changed] <- name
}
}Comprobación rápida del train y test, ambos conjuntos contienen el mismo número de columnas y similares proporciones. Como se esperaba, el gráfico de abajo muestra menos categorÃas que en la iteración 1. Es decir, está menos ‘segmentado’.
## [1] "gps_height" "longitude" "latitude"
## [4] "basin" "region" "region_code"
## [7] "district_code" "lga" "scheme_management"
## [10] "extraction_type" "extraction_type_group" "extraction_type_class"
## [13] "management" "management_group" "payment"
## [16] "payment_type" "water_quality" "quality_group"
## [19] "quantity" "quantity_group" "source"
## [22] "source_type" "source_class" "waterpoint_type"
## [25] "waterpoint_type_group" "fe_permit" "fe_public_meeting"
## [28] "fe_construction_year" "fe_diasem" "fe_wpt_name"
## [31] "fe_scheme_name" "fe_installer" "fe_funder"
## [34] "fe_anio2" "fe_mes2" "fe_dianum2"
## [37] "fe_amount_tsh2" "fe_population2" "status_group"
## [1] "gps_height" "longitude" "latitude"
## [4] "basin" "region" "region_code"
## [7] "district_code" "lga" "scheme_management"
## [10] "extraction_type" "extraction_type_group" "extraction_type_class"
## [13] "management" "management_group" "payment"
## [16] "payment_type" "water_quality" "quality_group"
## [19] "quantity" "quantity_group" "source"
## [22] "source_type" "source_class" "waterpoint_type"
## [25] "waterpoint_type_group" "fe_permit" "fe_public_meeting"
## [28] "fe_construction_year" "fe_diasem" "fe_wpt_name"
## [31] "fe_scheme_name" "fe_installer" "fe_funder"
## [34] "fe_anio2" "fe_mes2" "fe_dianum2"
## [37] "fe_amount_tsh2" "fe_population2"
Se encontro un outlier en el campo lga, aun muestra demasiadas categorias, por lo cual agrupamos los valores con un %<0.2 en una categoria ‘Otros’.
lga.labels <- unique(datMod2$lga)
lga.labels <- lga.labels[(lga.labels %in% c('Lindi Urban','Nyamagana','Arusha Urban','Kigoma Urban','Moshi Urban','Songea Urban','Bukoba Urban'))]
datMod2$fe_lga <-car::recode(datMod2$lga, "lga.labels = 'Otros'")
lga.labels <- unique(predict_x2$lga)
lga.labels <- lga.labels[(lga.labels %in% c('Lindi Urban','Nyamagana','Arusha Urban','Kigoma Urban','Moshi Urban','Songea Urban','Bukoba Urban'))]
predict_x2$fe_lga <-car::recode(predict_x2$lga, "lga.labels = 'Otros'")
freq(datMod2$fe_lga, sort = 'dec')Se agregan otros Modelos o algoritmos, manteniendo ranger con los mismos hiperparámetros que la iteración A. Los porcentajes de acierto o ‘Acuracies’ se muestran a continuacion: - Ranger sin remuestreo: 0.8097 - Ranger y stratifiedKfolds: 0.7617 - GBM sin remuestreo: 1 <- Debe haber un problema en este caso, por falta de tiempo no he podido resolverlo. - GBM y stratifiedKfolds: 0.6806
Se ve una pequena mejora en el modelo Ranger sin muestreo, que fue 0.8074 en el intento A.
# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(2023)
validationIndex <- createDataPartition(datMod2$status_group, p=0.80, list=FALSE)
# select 20% of the data for validation
my_test2 <-datMod2[-validationIndex,]
# use the remaining 80% of data to training and testing the
my_train2 <-datMod2[validationIndex,]
tic()
set.seed(1)
fit_r2 <- ranger(
status_group ~. ,
data = my_train2,
num.trees = fit_r1$num.trees,
importance = 'impurity',
write.forest = TRUE,
min.node.size = fit_r1$min.node.size,
splitrule = "gini",
verbose = TRUE,
classification = TRUE
)
toc()## 14.126 sec elapsed
## Ranger result
##
## Call:
## ranger(status_group ~ ., data = my_train2, num.trees = fit_r1$num.trees, importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size, splitrule = "gini", verbose = TRUE, classification = TRUE)
##
## Type: Classification
## Number of trees: 100
## Sample size: 47522
## Number of independent variables: 38
## Mtry: 6
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 18.91 %
## Ranger result
##
## Call:
## ranger(status_group ~ ., data = my_train2, num.trees = fit_r1$num.trees, importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size, splitrule = "gini", verbose = TRUE, classification = TRUE)
##
## Type: Classification
## Number of trees: 100
## Sample size: 47522
## Number of independent variables: 38
## Mtry: 6
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 18.91 %
## Length Class Mode
## predictions 47522 factor numeric
## num.trees 1 -none- numeric
## num.independent.variables 1 -none- numeric
## mtry 1 -none- numeric
## min.node.size 1 -none- numeric
## variable.importance 38 -none- numeric
## prediction.error 1 -none- numeric
## forest 10 ranger.forest list
## confusion.matrix 9 table numeric
## splitrule 1 -none- character
## treetype 1 -none- character
## call 10 -none- call
## importance.mode 1 -none- character
## num.samples 1 -none- numeric
## replace 1 -none- logical
vars_imp2 <- fit_r2$variable.importance
vars_imp2 <- as.data.frame(vars_imp2)
vars_imp2$myvar <- rownames(vars_imp2)
library(ggpubr) # aunque ya estaba cargada al principio.
ggbarplot(vars_imp2[1:10,],
x = "myvar", y = "vars_imp2",
#fill = 'myvar',
color = "blue", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in descending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 90, # Rotate vertically x axis texts
ylab = "Importancia",
xlab = 'Variable',
#legend.title = "MPG Group",
rotate = TRUE,
ggtheme = theme_minimal(),
main = "ranger con dataMod2"
)## Ranger result
##
## Call:
## ranger(status_group ~ ., data = my_train2, num.trees = fit_r1$num.trees, importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size, splitrule = "gini", verbose = TRUE, classification = TRUE)
##
## Type: Classification
## Number of trees: 100
## Sample size: 47522
## Number of independent variables: 38
## Mtry: 6
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 18.91 %
## Ranger result
##
## Call:
## ranger(status_group ~ ., data = my_train2, num.trees = fit_r1$num.trees, importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size, splitrule = "gini", verbose = TRUE, classification = TRUE)
##
## Type: Classification
## Number of trees: 100
## Sample size: 47522
## Number of independent variables: 38
## Mtry: 6
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 18.91 %
## Length Class Mode
## predictions 47522 factor numeric
## num.trees 1 -none- numeric
## num.independent.variables 1 -none- numeric
## mtry 1 -none- numeric
## min.node.size 1 -none- numeric
## variable.importance 38 -none- numeric
## prediction.error 1 -none- numeric
## forest 10 ranger.forest list
## confusion.matrix 9 table numeric
## splitrule 1 -none- character
## treetype 1 -none- character
## call 10 -none- call
## importance.mode 1 -none- character
## num.samples 1 -none- numeric
## replace 1 -none- logical
validation.fit_r2 <- predict(fit_r2, data = my_test2)
confusionMatrix( my_test2$status_group, validation.fit_r2$predictions)## Confusion Matrix and Statistics
##
## Reference
## Prediction functional functional needs repair non functional
## functional 5805 140 506
## functional needs repair 443 289 131
## non functional 954 86 3524
##
## Overall Statistics
##
## Accuracy : 0.8097
## 95% CI : (0.8026, 0.8168)
## No Information Rate : 0.6063
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.643
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: functional Class: functional needs repair
## Sensitivity 0.8060 0.56117
## Specificity 0.8618 0.94949
## Pos Pred Value 0.8999 0.33488
## Neg Pred Value 0.7426 0.97948
## Prevalence 0.6063 0.04336
## Detection Rate 0.4887 0.02433
## Detection Prevalence 0.5431 0.07266
## Balanced Accuracy 0.8339 0.75533
## Class: non functional
## Sensitivity 0.8469
## Specificity 0.8652
## Pos Pred Value 0.7721
## Neg Pred Value 0.9129
## Prevalence 0.3503
## Detection Rate 0.2967
## Detection Prevalence 0.3842
## Balanced Accuracy 0.8561
set.seed(1)
my_train_l <- copy(my_train2)
my_test_l <- copy(my_test2)
# Caret no aceptaba los factores originales
levels(my_train_l$status_group) <- c('f','fnr','nf')
levels(my_test_l$status_group) <- c('f','fnr','nf')
n.folds <- 5
cvIndex <- createFolds(factor(my_train_l$status_group), n.folds, returnTrain = T)
trainControl <- trainControl(index = cvIndex,
method = "cv",
number = n.folds,
classProbs = TRUE
)
rf.grid<-expand.grid( mtry = fit_r1$mtry,
splitrule = 'gini',
min.node.size = fit_r1$min.node.size
)
tic()
set.seed(1)
fit_r21 <- train(
status_group ~.,
data = my_train_l,
method = "ranger",
metric = "Accuracy",
importance = 'impurity',
tuneGrid = rf.grid,
num.trees = fit_r1$num.trees,
trControl = trainControl,
verbose = TRUE
)
toc()## 138.356 sec elapsed
validation.fit_r21 <- predict(fit_r21, newdata = my_test_l)
confusionMatrix( my_test_l$status_group, validation.fit_r21)## Confusion Matrix and Statistics
##
## Reference
## Prediction f fnr nf
## f 6004 19 428
## fnr 645 81 137
## nf 1581 20 2963
##
## Overall Statistics
##
## Accuracy : 0.7617
## 95% CI : (0.754, 0.7694)
## No Information Rate : 0.6929
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5318
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: f Class: fnr Class: nf
## Sensitivity 0.7295 0.675000 0.8399
## Specificity 0.8775 0.933492 0.8083
## Pos Pred Value 0.9307 0.093859 0.6492
## Neg Pred Value 0.5898 0.996459 0.9228
## Prevalence 0.6929 0.010103 0.2970
## Detection Rate 0.5055 0.006819 0.2495
## Detection Prevalence 0.5431 0.072655 0.3842
## Balanced Accuracy 0.8035 0.804246 0.8241
## ranger variable importance
##
## only 20 most important variables shown (out of 434)
##
## Overall
## waterpoint_type_groupother 100.00
## quantity_groupenough 94.46
## quantityenough 85.29
## longitude 84.45
## waterpoint_typeother 79.33
## latitude 73.81
## extraction_type_groupother 71.87
## extraction_type_classother 70.01
## gps_height 58.58
## extraction_typeother 57.79
## fe_amount_tsh2 55.69
## fe_population2 45.94
## payment_typenever pay 45.46
## waterpoint_typecommunal standpipe 37.96
## extraction_typegravity 33.18
## waterpoint_typehand pump 33.01
## waterpoint_type_grouphand pump 28.88
## quantityinsufficient 28.64
## managementvwc 28.62
## extraction_type_groupgravity 25.93
tic()
set.seed(1)
my_train2 %>% mutate_if(is.character, as.factor) -> my_train_f
my_test2 %>% mutate_if(is.character, as.factor) -> my_test_f
fit_g2 <- gbm.fit(my_train_f[ ,c(1:38)], my_train_f$status_group,
distribution = "multinomial",
n.trees = fit_r1$num.trees,
interaction.depth = 2,
n.minobsinnode = fit_r1$min.node.size,
shrinkage = 0.01,
bag.fraction = 1,
)## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0688 nan 0.0100 nan
## 3 1.0402 nan 0.0100 nan
## 4 1.0125 nan 0.0100 nan
## 5 0.9859 nan 0.0100 nan
## 6 0.9601 nan 0.0100 nan
## 7 0.9353 nan 0.0100 nan
## 8 0.9113 nan 0.0100 nan
## 9 0.8880 nan 0.0100 nan
## 10 0.8656 nan 0.0100 nan
## 20 0.6758 nan 0.0100 nan
## 40 0.4255 nan 0.0100 nan
## 60 0.2746 nan 0.0100 nan
## 80 0.1797 nan 0.0100 nan
## 100 0.1186 nan 0.0100 nan
## 32.964 sec elapsed
## A gradient boosted model with multinomial loss function.
## 100 iterations were performed.
## There were 38 predictors of which 10 had non-zero influence.
## A gradient boosted model with multinomial loss function.
## 100 iterations were performed.
## There were 38 predictors of which 10 had non-zero influence.
validation.fit_g2 = predict.gbm(fit_g2,n.trees=100, newdata=my_test_f[ ,c(1:38)],type='response')
p.validation.fit_g2 <- apply(validation.fit_g2, 1, which.max)
validation.fit_g2 <- as.factor( colnames(validation.fit_g2)[p.validation.fit_g2])
confusionMatrix( validation.fit_g2,my_test_f$status_group)## Confusion Matrix and Statistics
##
## Reference
## Prediction functional functional needs repair non functional
## functional 6451 0 0
## functional needs repair 0 863 0
## non functional 0 0 4564
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.5431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: functional Class: functional needs repair
## Sensitivity 1.0000 1.00000
## Specificity 1.0000 1.00000
## Pos Pred Value 1.0000 1.00000
## Neg Pred Value 1.0000 1.00000
## Prevalence 0.5431 0.07266
## Detection Rate 0.5431 0.07266
## Detection Prevalence 0.5431 0.07266
## Balanced Accuracy 1.0000 1.00000
## Class: non functional
## Sensitivity 1.0000
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 1.0000
## Prevalence 0.3842
## Detection Rate 0.3842
## Detection Prevalence 0.3842
## Balanced Accuracy 1.0000
my_train_l <- copy(my_train2)
my_test_l <- copy(my_test2)
# Caret no aceptaba los factores originales
levels(my_train_l$status_group) <- c('f','fnr','nf')
levels(my_test_l$status_group) <- c('f','fnr','nf')
tic()
set.seed(1)
n.folds <- 5
cvIndex <- createFolds(factor(my_train_l$status_group), n.folds, returnTrain = T)
trainControl <- trainControl(index = cvIndex,
method = "cv",
number = n.folds,
classProbs = TRUE,
)
gbmgrid<-expand.grid(shrinkage=c(0.01),
n.minobsinnode=c(fit_r1$min.node.size),
n.trees=c(fit_r1$num.trees),
interaction.depth=c(2))
set.seed(1)
fit_g21 <- train(
status_group~.,
data = my_train_l,
method = "gbm",
metric = "Accuracy",
bag.fraction=1,
distribution="multinomial",
trControl = trainControl,
tuneGrid=gbmgrid,
verbose = TRUE
)## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0824 nan 0.0100 nan
## 4 1.0747 nan 0.0100 nan
## 5 1.0672 nan 0.0100 nan
## 6 1.0599 nan 0.0100 nan
## 7 1.0529 nan 0.0100 nan
## 8 1.0460 nan 0.0100 nan
## 9 1.0394 nan 0.0100 nan
## 10 1.0329 nan 0.0100 nan
## 20 0.9777 nan 0.0100 nan
## 40 0.9030 nan 0.0100 nan
## 60 0.8564 nan 0.0100 nan
## 80 0.8245 nan 0.0100 nan
## 100 0.8013 nan 0.0100 nan
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0824 nan 0.0100 nan
## 4 1.0747 nan 0.0100 nan
## 5 1.0672 nan 0.0100 nan
## 6 1.0599 nan 0.0100 nan
## 7 1.0529 nan 0.0100 nan
## 8 1.0460 nan 0.0100 nan
## 9 1.0394 nan 0.0100 nan
## 10 1.0329 nan 0.0100 nan
## 20 0.9776 nan 0.0100 nan
## 40 0.9027 nan 0.0100 nan
## 60 0.8556 nan 0.0100 nan
## 80 0.8236 nan 0.0100 nan
## 100 0.8005 nan 0.0100 nan
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 78: scheme_managementNone has no variation.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0824 nan 0.0100 nan
## 4 1.0746 nan 0.0100 nan
## 5 1.0671 nan 0.0100 nan
## 6 1.0599 nan 0.0100 nan
## 7 1.0528 nan 0.0100 nan
## 8 1.0460 nan 0.0100 nan
## 9 1.0393 nan 0.0100 nan
## 10 1.0329 nan 0.0100 nan
## 20 0.9776 nan 0.0100 nan
## 40 0.9025 nan 0.0100 nan
## 60 0.8556 nan 0.0100 nan
## 80 0.8239 nan 0.0100 nan
## 100 0.8010 nan 0.0100 nan
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 48: region_code40 has no variation.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0903 nan 0.0100 nan
## 3 1.0823 nan 0.0100 nan
## 4 1.0746 nan 0.0100 nan
## 5 1.0671 nan 0.0100 nan
## 6 1.0598 nan 0.0100 nan
## 7 1.0527 nan 0.0100 nan
## 8 1.0459 nan 0.0100 nan
## 9 1.0392 nan 0.0100 nan
## 10 1.0328 nan 0.0100 nan
## 20 0.9774 nan 0.0100 nan
## 40 0.9028 nan 0.0100 nan
## 60 0.8562 nan 0.0100 nan
## 80 0.8244 nan 0.0100 nan
## 100 0.8010 nan 0.0100 nan
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0824 nan 0.0100 nan
## 4 1.0747 nan 0.0100 nan
## 5 1.0673 nan 0.0100 nan
## 6 1.0600 nan 0.0100 nan
## 7 1.0530 nan 0.0100 nan
## 8 1.0462 nan 0.0100 nan
## 9 1.0396 nan 0.0100 nan
## 10 1.0332 nan 0.0100 nan
## 20 0.9781 nan 0.0100 nan
## 40 0.9036 nan 0.0100 nan
## 60 0.8572 nan 0.0100 nan
## 80 0.8253 nan 0.0100 nan
## 100 0.8029 nan 0.0100 nan
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0824 nan 0.0100 nan
## 4 1.0747 nan 0.0100 nan
## 5 1.0672 nan 0.0100 nan
## 6 1.0599 nan 0.0100 nan
## 7 1.0529 nan 0.0100 nan
## 8 1.0460 nan 0.0100 nan
## 9 1.0394 nan 0.0100 nan
## 10 1.0329 nan 0.0100 nan
## 20 0.9777 nan 0.0100 nan
## 40 0.9029 nan 0.0100 nan
## 60 0.8565 nan 0.0100 nan
## 80 0.8246 nan 0.0100 nan
## 100 0.8015 nan 0.0100 nan
## 1402.459 sec elapsed
validation.fit_g21 <- predict(fit_g21, newdata = my_test_l)
confusionMatrix( my_test_l$status_group, validation.fit_g21)## Confusion Matrix and Statistics
##
## Reference
## Prediction f fnr nf
## f 5901 0 550
## fnr 733 0 130
## nf 2381 0 2183
##
## Overall Statistics
##
## Accuracy : 0.6806
## 95% CI : (0.6721, 0.689)
## No Information Rate : 0.759
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.355
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: f Class: fnr Class: nf
## Sensitivity 0.6546 NA 0.7625
## Specificity 0.8079 0.92734 0.7359
## Pos Pred Value 0.9147 NA 0.4783
## Neg Pred Value 0.4262 NA 0.9070
## Prevalence 0.7590 0.00000 0.2410
## Detection Rate 0.4968 0.00000 0.1838
## Detection Prevalence 0.5431 0.07266 0.3842
## Balanced Accuracy 0.7312 NA 0.7492
## gbm variable importance
##
## only 20 most important variables shown (out of 434)
##
## Overall
## waterpoint_typeother 100.0000
## quantityenough 74.9895
## extraction_typeother 58.6767
## fe_amount_tsh2 23.0301
## waterpoint_typecommunal standpipe multiple 17.7459
## quantityinsufficient 14.9790
## longitude 10.3591
## water_qualityunknown 4.7173
## payment_typenever pay 3.5205
## fe_lgaBariadi 2.7543
## managementvwc 2.6633
## extraction_typegravity 2.4734
## regionMwanza 2.1422
## fe_lgaKigoma Rural 1.7873
## sourcerainwater harvesting 1.7284
## extraction_typenira/tanira 1.4516
## fe_funderGovernment Of Tanzania 1.3614
## paymentunknown 0.8093
## fe_lgaChunya 0.6117
## sourcemachine dbh 0.6080
Vamos a ver qué tipo de balanceo tienen las variables categoricas basin y region explorando sus frecuencias.
| Var1 | Freq |
|---|---|
| Internal | 13.11 |
| Lake Nyasa | 8.56 |
| Lake Rukwa | 4.13 |
| Lake Tanganyika | 10.83 |
| Lake Victoria | 17.25 |
| Pangani | 15.05 |
| Rufiji | 13.43 |
| Ruvuma / Southern Coast | 7.56 |
| Wami / Ruvu | 10.08 |
| Var1 | Freq |
|---|---|
| Arusha | 5.64 |
| Dar es Salaam | 1.36 |
| Dodoma | 3.71 |
| Iringa | 8.91 |
| Kagera | 5.58 |
| Kigoma | 4.74 |
| Kilimanjaro | 7.37 |
| Lindi | 2.60 |
| Manyara | 2.66 |
| Mara | 3.31 |
| Mbeya | 7.81 |
| Morogoro | 6.74 |
| Mtwara | 2.91 |
| Mwanza | 5.22 |
| Pwani | 4.44 |
| Rukwa | 3.04 |
| Ruvuma | 4.44 |
| Shinyanga | 8.39 |
| Singida | 3.52 |
| Tabora | 3.30 |
| Tanga | 4.29 |
Distribución de basin. ¿Cuántos distritos diferentes tenemos en el conjunto.
En esta versión lo que hacemos es usar las frecuencias de cada categorÃa en vez de utilizar en one-hot.
datMod3 = copy(datMod2)
predict_x3 <- copy(predict_x2)
# Voy a convertir a Frequencia las variables basin y region
datMod3[ , fe_basin := .N , by = .(basin)]
datMod3[ , fe_region := .N , by = .(region)]
to_rem <- c('basin', 'region')
datMod3[ , (to_rem) := NULL]
predict_x3[ , fe_basin := .N , by = .(basin)]
predict_x3[ , fe_region := .N , by = .(region)]
to_rem <- c('basin', 'region')
predict_x3[ , (to_rem) := NULL]Mismos algoritmos, hiperparámetros y remuestreos que la iteración B. Solo cambian las transformaciones añadidas al train y test. Los resultados son peores en general, la accuracy en validación cae en 3 de los 4 modelos respecto a sus análogos de la iteración B.
Hay una mejora de unos tres puntos porcentuales en el caso de ranger con stratifiedKfolds.
# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(1)
validationIndex <- createDataPartition(datMod3$status_group, p=0.80, list=FALSE)
# select 20% of the data for validation
my_test3 <-datMod3[-validationIndex,]
# use the remaining 80% of data to training and testing the
my_train3 <-datMod3[validationIndex,]
tic()
set.seed(1)
fit_r3 <- ranger(
status_group ~. ,
data = my_train3,
num.trees = fit_r1$num.trees,
importance = 'impurity',
write.forest = TRUE,
min.node.size = fit_r1$min.node.size,
splitrule = "gini",
verbose = TRUE,
classification = TRUE
)
toc()## 13.173 sec elapsed
## Ranger result
##
## Call:
## ranger(status_group ~ ., data = my_train3, num.trees = fit_r1$num.trees, importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size, splitrule = "gini", verbose = TRUE, classification = TRUE)
##
## Type: Classification
## Number of trees: 100
## Sample size: 47522
## Number of independent variables: 38
## Mtry: 6
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 18.96 %
## Ranger result
##
## Call:
## ranger(status_group ~ ., data = my_train3, num.trees = fit_r1$num.trees, importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size, splitrule = "gini", verbose = TRUE, classification = TRUE)
##
## Type: Classification
## Number of trees: 100
## Sample size: 47522
## Number of independent variables: 38
## Mtry: 6
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 18.96 %
## Length Class Mode
## predictions 47522 factor numeric
## num.trees 1 -none- numeric
## num.independent.variables 1 -none- numeric
## mtry 1 -none- numeric
## min.node.size 1 -none- numeric
## variable.importance 38 -none- numeric
## prediction.error 1 -none- numeric
## forest 10 ranger.forest list
## confusion.matrix 9 table numeric
## splitrule 1 -none- character
## treetype 1 -none- character
## call 10 -none- call
## importance.mode 1 -none- character
## num.samples 1 -none- numeric
## replace 1 -none- logical
vars_imp3 <- fit_r3$variable.importance
vars_imp3 <- as.data.frame(vars_imp3)
vars_imp3$myvar <- rownames(vars_imp3)
library(ggpubr) # aunque ya estaba cargada al principio.
ggbarplot(vars_imp3[1:10,],
x = "myvar", y = "vars_imp3",
#fill = 'myvar',
color = "blue", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in descending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 90, # Rotate vertically x axis texts
ylab = "Importancia",
xlab = 'Variable',
#legend.title = "MPG Group",
rotate = TRUE,
ggtheme = theme_minimal(),
main = "ranger con dataMod"
)## Ranger result
##
## Call:
## ranger(status_group ~ ., data = my_train3, num.trees = fit_r1$num.trees, importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size, splitrule = "gini", verbose = TRUE, classification = TRUE)
##
## Type: Classification
## Number of trees: 100
## Sample size: 47522
## Number of independent variables: 38
## Mtry: 6
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 18.96 %
## Ranger result
##
## Call:
## ranger(status_group ~ ., data = my_train3, num.trees = fit_r1$num.trees, importance = "impurity", write.forest = TRUE, min.node.size = fit_r1$min.node.size, splitrule = "gini", verbose = TRUE, classification = TRUE)
##
## Type: Classification
## Number of trees: 100
## Sample size: 47522
## Number of independent variables: 38
## Mtry: 6
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 18.96 %
## Length Class Mode
## predictions 47522 factor numeric
## num.trees 1 -none- numeric
## num.independent.variables 1 -none- numeric
## mtry 1 -none- numeric
## min.node.size 1 -none- numeric
## variable.importance 38 -none- numeric
## prediction.error 1 -none- numeric
## forest 10 ranger.forest list
## confusion.matrix 9 table numeric
## splitrule 1 -none- character
## treetype 1 -none- character
## call 10 -none- call
## importance.mode 1 -none- character
## num.samples 1 -none- numeric
## replace 1 -none- logical
validation.fit_r3 <- predict(fit_r3, data = my_test3)
confusionMatrix( my_test3$status_group, validation.fit_r3$predictions)## Confusion Matrix and Statistics
##
## Reference
## Prediction functional functional needs repair non functional
## functional 5805 164 482
## functional needs repair 433 304 126
## non functional 925 71 3568
##
## Overall Statistics
##
## Accuracy : 0.8147
## 95% CI : (0.8076, 0.8217)
## No Information Rate : 0.603
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6531
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: functional Class: functional needs repair
## Sensitivity 0.8104 0.56401
## Specificity 0.8630 0.95070
## Pos Pred Value 0.8999 0.35226
## Neg Pred Value 0.7498 0.97867
## Prevalence 0.6030 0.04538
## Detection Rate 0.4887 0.02559
## Detection Prevalence 0.5431 0.07266
## Balanced Accuracy 0.8367 0.75735
## Class: non functional
## Sensitivity 0.8544
## Specificity 0.8707
## Pos Pred Value 0.7818
## Neg Pred Value 0.9169
## Prevalence 0.3516
## Detection Rate 0.3004
## Detection Prevalence 0.3842
## Balanced Accuracy 0.8625
prediction.fit_r3 <- predict(fit_r3, data = predict_x3)
file.prediction.fit_r3 <- cbind.data.frame(test_id, prediction.fit_r3$predictions)
colnames(file.prediction.fit_r3) <- c("id","status_group")
fwrite(file.prediction.fit_r3, file ="predi_fit_r3.csv",sep=)my_train_l <- copy(my_train3)
my_test_l <- copy(my_test3)
# Caret no aceptaba los factores originales
levels(my_train_l$status_group) <- c('f','fnr','nf')
levels(my_test_l$status_group) <- c('f','fnr','nf')
n.folds <- 5
cvIndex <- createFolds(factor(my_train_l$status_group), n.folds, returnTrain = T)
trainControl <- trainControl(index = cvIndex,
method = "cv",
number = n.folds,
classProbs = TRUE
)
rf.grid<-expand.grid( mtry = fit_r1$mtry,
splitrule = 'gini',
min.node.size = fit_r1$min.node.size
)
tic()
set.seed(1)
fit_r31 <- train(
status_group ~.,
data = my_train_l,
method = "ranger",
metric = "Accuracy",
importance = 'impurity',
tuneGrid = rf.grid,
num.trees = fit_r1$num.trees,
trControl = trainControl,
verbose = TRUE
)
toc()## 134.237 sec elapsed
validation.fit_r31 <- predict(fit_r31, newdata = my_test_l)
confusionMatrix( my_test_l$status_group, validation.fit_r31)## Confusion Matrix and Statistics
##
## Reference
## Prediction f fnr nf
## f 6010 19 422
## fnr 644 76 143
## nf 1543 17 3004
##
## Overall Statistics
##
## Accuracy : 0.7653
## 95% CI : (0.7576, 0.7729)
## No Information Rate : 0.6901
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5389
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: f Class: fnr Class: nf
## Sensitivity 0.7332 0.678571 0.8417
## Specificity 0.8802 0.933112 0.8123
## Pos Pred Value 0.9316 0.088065 0.6582
## Neg Pred Value 0.5970 0.996732 0.9228
## Prevalence 0.6901 0.009429 0.3005
## Detection Rate 0.5060 0.006398 0.2529
## Detection Prevalence 0.5431 0.072655 0.3842
## Balanced Accuracy 0.8067 0.805842 0.8270
## ranger variable importance
##
## only 20 most important variables shown (out of 408)
##
## Overall
## quantityenough 100.00
## quantity_groupenough 92.61
## extraction_type_groupother 89.40
## extraction_typeother 87.50
## waterpoint_type_groupother 86.69
## longitude 86.10
## latitude 84.50
## extraction_type_classother 75.65
## gps_height 73.17
## waterpoint_typeother 63.85
## fe_amount_tsh2 60.78
## fe_region 52.64
## fe_population2 45.94
## fe_basin 44.45
## payment_typenever pay 42.32
## waterpoint_typecommunal standpipe 40.42
## waterpoint_typehand pump 36.96
## extraction_typegravity 36.17
## extraction_type_classhandpump 33.37
## quantity_groupinsufficient 32.77
tic()
set.seed(1)
my_train3 %>% mutate_if(is.character, as.factor) -> my_train_f
my_test3 %>% mutate_if(is.character, as.factor) -> my_test_f
fit_g3 <- gbm.fit(my_train_f[ ,c(1:38)], my_train_f$status_group,
distribution = "multinomial",
n.trees = fit_r1$num.trees,
interaction.depth = 2,
n.minobsinnode = fit_r1$min.node.size,
shrinkage = 0.01,
bag.fraction = 1,
)## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0688 nan 0.0100 nan
## 3 1.0402 nan 0.0100 nan
## 4 1.0125 nan 0.0100 nan
## 5 0.9859 nan 0.0100 nan
## 6 0.9601 nan 0.0100 nan
## 7 0.9353 nan 0.0100 nan
## 8 0.9113 nan 0.0100 nan
## 9 0.8880 nan 0.0100 nan
## 10 0.8656 nan 0.0100 nan
## 20 0.6758 nan 0.0100 nan
## 40 0.4255 nan 0.0100 nan
## 60 0.2746 nan 0.0100 nan
## 80 0.1797 nan 0.0100 nan
## 100 0.1186 nan 0.0100 nan
## 32.214 sec elapsed
## A gradient boosted model with multinomial loss function.
## 100 iterations were performed.
## There were 38 predictors of which 11 had non-zero influence.
## A gradient boosted model with multinomial loss function.
## 100 iterations were performed.
## There were 38 predictors of which 11 had non-zero influence.
validation.fit_g3 = predict.gbm(fit_g3,n.trees=100, newdata=my_test_f[ ,c(1:38)],type='response')
p.validation.fit_g3 <- apply(validation.fit_g3, 1, which.max)
validation.fit_g3 <- as.factor( colnames(validation.fit_g3)[p.validation.fit_g3])
confusionMatrix( validation.fit_g3,my_test_f$status_group)## Confusion Matrix and Statistics
##
## Reference
## Prediction functional functional needs repair non functional
## functional 6451 0 0
## functional needs repair 0 863 0
## non functional 0 0 4564
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.5431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: functional Class: functional needs repair
## Sensitivity 1.0000 1.00000
## Specificity 1.0000 1.00000
## Pos Pred Value 1.0000 1.00000
## Neg Pred Value 1.0000 1.00000
## Prevalence 0.5431 0.07266
## Detection Rate 0.5431 0.07266
## Detection Prevalence 0.5431 0.07266
## Balanced Accuracy 1.0000 1.00000
## Class: non functional
## Sensitivity 1.0000
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 1.0000
## Prevalence 0.3842
## Detection Rate 0.3842
## Detection Prevalence 0.3842
## Balanced Accuracy 1.0000
my_train_l <- copy(my_train3)
my_test_l <- copy(my_test3)
# Caret no aceptaba los factores originales
levels(my_train_l$status_group) <- c('f','fnr','nf')
levels(my_test_l$status_group) <- c('f','fnr','nf')
tic()
set.seed(1)
n.folds <- 5
cvIndex <- createFolds(factor(my_train_l$status_group), n.folds, returnTrain = T)
trainControl <- trainControl(index = cvIndex,
method = "cv",
number = n.folds,
classProbs = TRUE,
)
gbmgrid<-expand.grid(shrinkage=c(0.01),
n.minobsinnode=c(fit_r1$min.node.size),
n.trees=c(fit_r1$num.trees),
interaction.depth=c(2))
set.seed(1)
fit_g31 <- train(
status_group~.,
data = my_train_l,
method = "gbm",
metric = "Accuracy",
bag.fraction=1,
distribution="multinomial",
trControl = trainControl,
tuneGrid=gbmgrid,
verbose = TRUE
)## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0824 nan 0.0100 nan
## 4 1.0747 nan 0.0100 nan
## 5 1.0672 nan 0.0100 nan
## 6 1.0600 nan 0.0100 nan
## 7 1.0530 nan 0.0100 nan
## 8 1.0462 nan 0.0100 nan
## 9 1.0395 nan 0.0100 nan
## 10 1.0331 nan 0.0100 nan
## 20 0.9781 nan 0.0100 nan
## 40 0.9036 nan 0.0100 nan
## 60 0.8572 nan 0.0100 nan
## 80 0.8251 nan 0.0100 nan
## 100 0.8022 nan 0.0100 nan
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0824 nan 0.0100 nan
## 4 1.0747 nan 0.0100 nan
## 5 1.0672 nan 0.0100 nan
## 6 1.0599 nan 0.0100 nan
## 7 1.0529 nan 0.0100 nan
## 8 1.0460 nan 0.0100 nan
## 9 1.0394 nan 0.0100 nan
## 10 1.0330 nan 0.0100 nan
## 20 0.9778 nan 0.0100 nan
## 40 0.9032 nan 0.0100 nan
## 60 0.8566 nan 0.0100 nan
## 80 0.8247 nan 0.0100 nan
## 100 0.8024 nan 0.0100 nan
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0825 nan 0.0100 nan
## 4 1.0748 nan 0.0100 nan
## 5 1.0674 nan 0.0100 nan
## 6 1.0602 nan 0.0100 nan
## 7 1.0532 nan 0.0100 nan
## 8 1.0464 nan 0.0100 nan
## 9 1.0398 nan 0.0100 nan
## 10 1.0334 nan 0.0100 nan
## 20 0.9785 nan 0.0100 nan
## 40 0.9039 nan 0.0100 nan
## 60 0.8571 nan 0.0100 nan
## 80 0.8249 nan 0.0100 nan
## 100 0.8024 nan 0.0100 nan
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 20: region_code40 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 50: scheme_managementNone has no variation.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0825 nan 0.0100 nan
## 4 1.0749 nan 0.0100 nan
## 5 1.0675 nan 0.0100 nan
## 6 1.0603 nan 0.0100 nan
## 7 1.0533 nan 0.0100 nan
## 8 1.0465 nan 0.0100 nan
## 9 1.0400 nan 0.0100 nan
## 10 1.0336 nan 0.0100 nan
## 20 0.9789 nan 0.0100 nan
## 40 0.9046 nan 0.0100 nan
## 60 0.8578 nan 0.0100 nan
## 80 0.8259 nan 0.0100 nan
## 100 0.8033 nan 0.0100 nan
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 70: extraction_typeother - mkulima/shinyanga has no
## variation.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0905 nan 0.0100 nan
## 3 1.0826 nan 0.0100 nan
## 4 1.0749 nan 0.0100 nan
## 5 1.0675 nan 0.0100 nan
## 6 1.0604 nan 0.0100 nan
## 7 1.0534 nan 0.0100 nan
## 8 1.0466 nan 0.0100 nan
## 9 1.0401 nan 0.0100 nan
## 10 1.0337 nan 0.0100 nan
## 20 0.9790 nan 0.0100 nan
## 40 0.9046 nan 0.0100 nan
## 60 0.8581 nan 0.0100 nan
## 80 0.8257 nan 0.0100 nan
## 100 0.8029 nan 0.0100 nan
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0986 nan 0.0100 nan
## 2 1.0904 nan 0.0100 nan
## 3 1.0825 nan 0.0100 nan
## 4 1.0748 nan 0.0100 nan
## 5 1.0674 nan 0.0100 nan
## 6 1.0601 nan 0.0100 nan
## 7 1.0531 nan 0.0100 nan
## 8 1.0464 nan 0.0100 nan
## 9 1.0398 nan 0.0100 nan
## 10 1.0334 nan 0.0100 nan
## 20 0.9785 nan 0.0100 nan
## 40 0.9041 nan 0.0100 nan
## 60 0.8574 nan 0.0100 nan
## 80 0.8252 nan 0.0100 nan
## 100 0.8026 nan 0.0100 nan
## 1328.609 sec elapsed
validation.fit_g31 <- predict(fit_g31, newdata = my_test_l)
confusionMatrix( my_test_l$status_group, validation.fit_g31)## Confusion Matrix and Statistics
##
## Reference
## Prediction f fnr nf
## f 5842 2 607
## fnr 738 1 124
## nf 2362 5 2197
##
## Overall Statistics
##
## Accuracy : 0.6769
## 95% CI : (0.6684, 0.6853)
## No Information Rate : 0.7528
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.349
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: f Class: fnr Class: nf
## Sensitivity 0.6533 1.250e-01 0.7503
## Specificity 0.7926 9.274e-01 0.7355
## Pos Pred Value 0.9056 1.159e-03 0.4814
## Neg Pred Value 0.4288 9.994e-01 0.9001
## Prevalence 0.7528 6.735e-04 0.2465
## Detection Rate 0.4918 8.419e-05 0.1850
## Detection Prevalence 0.5431 7.266e-02 0.3842
## Balanced Accuracy 0.7229 5.262e-01 0.7429
## gbm variable importance
##
## only 20 most important variables shown (out of 408)
##
## Overall
## waterpoint_typeother 100.0000
## quantityenough 76.3676
## extraction_typeother 56.0496
## fe_amount_tsh2 22.4840
## waterpoint_typecommunal standpipe multiple 20.9026
## quantityinsufficient 15.3666
## longitude 9.6876
## payment_typenever pay 5.4194
## water_qualityunknown 4.3711
## fe_region 4.1316
## fe_lgaBariadi 3.2110
## fe_lgaKigoma Rural 2.7922
## managementvwc 2.1924
## fe_funderGovernment Of Tanzania 2.0766
## extraction_typegravity 1.8127
## region_code19 1.8001
## extraction_typenira/tanira 0.9558
## sourcerainwater harvesting 0.6026
## fe_basin 0.5972
## fe_public_meetingDesconocido 0.3309
El mejor resultado fue producido utilizando modelo Random Forest y usando la libreria ranger. Y siendo la mejor iteracion es la opcion B, donde fueron incluidas las variables numericas, y usando el metodo de lumping en las variables categoricas que presentaban muchas categorias. Esta opcion alcanzo un score de 0.8174 en la plataforma del concurso (ver imagen adjunta).
La siguiente etapa deberia ser el tuneado del modelo, explorando cambios en los hiperparametros del mejor modelo, debido a limites computacionales y tiempo no logre completar esa segunda etapa.
Otras transformaciones en los campos de latitud, longitud y altura de los acuiferos podrian ser exploradas a futuro, debido a limitaciones de tiempo esta parte no logro ser realizada en el presente trabajo.